class: center, middle, inverse, title-slide # Lecture 10 ## Multiple Groups ### Psych 10 C ### University of California, Irvine ### 04/20/2022 --- ## Hypotheses as models - When we have one independent variable that can take more than 2 values comparing a Null model which assumes that every group has the same expected value with the effects model which assumes that every group has a different expected value is not enough to answer our research question. -- - This is because the Effects model can be the better than the null in multiple ways. -- - For example, the Effects model will account for the data better if it is true that: -- 1. If every group is different `\(\mu_1 \neq \mu_2 \neq \mu_3\)`. 2. If at least one group is different `\(\mu_1 = \mu_2 \neq \mu_3\)`. -- - Both of the previous statements are consistent whit the main assumption of the Effects model, which in isolation can only tell us that at least one group is different. --- ## Hypothesis as models - As we showed with the anxiety example, we can reach an answer to our research question by comparing multiple model, which assume that one group is different from the rest. We do this with each group one at a time. -- - When our independent variable can take only 3 different values, this resulted in us having to look at the predictions and errors of 5 different models. -- - If we increase the number of goups (or values of our independent variable) this number will increase rapidly. For example, if we have 4 groups the number of models that we could compare is 9, and for 5 it grows to 17. -- - We would prefer not to have to look at so many models every time we have more than two groups. -- - One way to reduce the number of models that we look at is by making our research question more precise. --- ## A better approach to comparing anxiety levels - Going back to wour anxiety levels example, instead of asking the question: -- - Are there any differences between the anxiety levels of First year students in the 218, 2019, and 2020 cohorts? -- - We could ask the question: -- - Are the anxiety levels of students from the 2020 cohort different from students in 2019 and 2018? -- - Notice that our new question is now centered around First year students on the 2020 cohort and not on differences between all cohorts. -- - In order to answer this question we only need to look at 3 models, first the Null, then the Effects and finally the 2020 model from last class. -- - Comparing Null and Effects model will tell us if there is at least one group that is different. Then our 2020 model will answer our actual question, Are the anxiety levels of First year students in the 2020 cohort different from students in the 2018 and 2019 cohorts? --- ## Hypothesis and models - Thus, just by redefining our experimental question we can reduce the number of comparisons that we have to make. -- - In the previous example our second question makes more sense as there is no real reason to assume that 2018 and 2019 should be different, however, due to the pandemic, we would expect First year students from the 2020 cohort to be more anxious than others. -- - One important thing to keep in mind is that we don't look for hypothesis that will reduce the number of models that we need to compare. A hypothesis should always come first. -- - In other words, we perform an experiment because we have a hypothesis regarding the outcomes of an experiment. This hypothesis will also inform our model comparisons. -- - As we mention before, this only becomes a problem when we have multiple groups and not when there are only two. --- class: inverse, center, middle # Example ## New treatment for blood preasure --- # Research problem: - A pharmaceutical company is about to launch a new drug as a treatment for high blood pressure -- - Before that, they have to show that their treatment works better than a control substance (placebo) and is at least as good as the current standard of care (standard) treatment. -- - They have designed an experiment where participants are randomly assigned to one of three groups. There will be 25 participants assigned to each group for a total of 75. -- - Given our objective, which models would you consider for a comparison? .can-edit.key-likes[ Model 1: Model 2: Model 3: Model 4: Model 5: ] --- ## Experiment - For each participant the experimenters have recorded the blood pressure of each participant after 2 weeks of treatment. They have obtained the following data:
--- ## Visualizing data - Another way in which we can visualize our data is by constructing a combination between a box-plot and a histogram: .pull-left[ ```r ggplot(data = drug_trial) + aes(x = condition) + aes(y = blood_pressure) + aes(color = condition) + geom_boxplot() + geom_dotplot(binaxis='y', stackdir='center', dotsize = 0.7, method = "histodot",col = "#55555555",fill = "#55555555", binwidth = 2) + theme_classic() + xlab("Condition") + ylab("Blood pressure") + guides(fill = guide_legend("Exam"), color = "none") + theme(axis.title.x = element_text(size = 25), axis.title.y = element_text(size = 25)) ``` ] .pull-right[ <img src="data:image/png;base64,#lec-10_files/figure-html/hist-score-out-1.png" style="display: block; margin: auto;" /> ] --- ## Models - We will start by evaluating our two usual models, the Null and Effects model. -- - **Null model:** There are no differences in blood pressure levels between participants regardless of treatment. `$$y_{ij}\sim\text{Normal}(\mu,\sigma_0^2)$$` -- - **Effects model:** At least one of the three conditions has a different expected blood pressure level than the others. `$$y_{ij}\sim\text{Normal}(\mu_j,\sigma_e^2)$$` --- ## Null model - As in previous examples first we want to calculate the predictions of the model, the Sum of Squared Errors (SSE) and the Mean Squared Error of the Null. .pull-left[ ```r # sample size n_total <- nrow(drug_trial) # add null prediction and error drug_trial <- drug_trial %>% mutate("prediction_null" = mean(blood_pressure), "error_null" = (blood_pressure - prediction_null)^2) # calculate sum of squared error sse_null <- sum(drug_trial$error_null) # calculate mean squared error mse_null <- 1/n_total * sse_null # calculate bic bic_null <- n_total * log(mse_null) + 1 * log(n_total) ``` ] .pull-right[ - Prediction = 129.16 - SSE = 5135 - MSE = 68.47 - BIC = 321.29 ] --- ## Effects model - Now we have to calculate the predictions, SSE and MSE associated with the Effects model. Remember that the Effects model assumes that each group follows its own normal distribution: -- .pull-left[ ```r # predictions by group effects <- drug_trial %>% group_by(condition) %>% summarise("pred" = mean(blood_pressure)) # add null prediction and error drug_trial <- drug_trial %>% mutate("prediction_effects" = case_when(condition == "control" ~ effects$pred[1], condition == "standard" ~ effects$pred[3], condition == "new_drug" ~ effects$pred[2]), "error_effects" = (blood_pressure - prediction_effects)^2) # calculate sum of squared error sse_effects <- sum(drug_trial$error_effects) # calculate mean squared error mse_effects <- 1/n_total * sse_effects # calculate bic bic_effects <- n_total * log(mse_effects) + 3 * log(n_total) ``` ] .pull-right[ - Prediction: - control = 136.62 - standard = 130.03 - new drug = 120.83 - SSE = 1989.31 - MSE = 26.52 - `\(R^2\)` = 0.61 - BIC = 258.81 ] --- ## New drug model - For this model we assume that only the expectation of the new drug is different from the rest. We can express the model as: `$$y_{i3}\sim\text{Normal}(\mu_3,\sigma_3^2)$$` -- - This is the model that we are interested in! -- - We evaluate this model as the others, by calculating its predictions and error. --- ## New drug model .pull-left[ ```r # create an indicator variable drug_trial <- drug_trial %>% mutate("id_newdrug" = ifelse(test = condition == "new_drug", yes = "new", no = "other")) # predictions by group new_drug <- drug_trial %>% group_by(id_newdrug) %>% summarise("pred" = mean(blood_pressure)) # add null prediction and error drug_trial <- drug_trial %>% mutate("prediction_newdrug" = case_when(id_newdrug == "other" ~ new_drug$pred[2], id_newdrug == "new" ~ new_drug$pred[1]), "error_newdrug" = (blood_pressure - prediction_newdrug)^2) # calculate sum of squared error sse_newdrug <- sum(drug_trial$error_newdrug) # calculate mean squared error mse_newdrug <- 1/n_total * sse_newdrug # calculate bic bic_newdrug <- n_total * log(mse_newdrug) + 2 * log(n_total) ``` ] .pull-right[ - Prediction: - control = 133.33 - standard = 133.33 - new drug = 120.83 - SSE = 2532.5 - MSE = 33.77 - `\(R^2\)` = 0.51 - BIC = 272.6 ] --- ## Results - From our results we can see that the best model was the Effects mode, furthermore, the **new drug** model was also better at accounting for our observations than the Null model was. -- - From these comparisons we can safely answer the question that we had: Is the new drug better than a control substance and at least as good as the standard of care? -- - The results of our model comparisons show that the there is at least one group that had a different expected level of blood pressure, and at the same time, that assuming that the new drug is different from the standard of care and the control is better than assuming that the 3 groups are equal. -- - This suggest that the blood pressure levels of participants that used the new drug are different from the control and at least equal to the standard of care. --- ## Results - From our model comparisons we could conclude that: -- - Participants that used the new drug for a week had an average blood pressure of 120.83, while participants using the standard of care drug and a control substance had blood pressures of 130.03 and 136.62 respectively. -- - Given that the model that assumes that at least one group is different and the one that assumes that blood pressure in the group that use the new drug are both better than the Null model, we can conclude that the new drug is better than the control group and at least as good as the standard of care. -- - The Effects model accounted for 61% of the total variation in our observations. --- ## Visualizing model fit - Once we have selected a model we can visualize how close its predictions are to our observations. This is considered a visualization of the accuracy or fit of the model. -- - Although we can't use this as a method for model comparison, it gives the reader information regarding how good the model predictions are in the context of the experiment. -- - As we said at the beginning of the course, statistics are functions of random variables (our experimental outcomes) and as such they follow a probability distribution. -- - Our estimators for the expected value of each group distribution (the average of each group) are statistics, and therefore, they follow their own probability distribution. -- - The good thing is that given the assumptions in our models (observations follow a Normal distribution) the distribution of our estimator is also Normal! --- ## Adding error bars to model predictions - To know add error bars to our estimators (AKA confidence intervals) we have to take into account their distribution. -- - For the model that we selected using our model comparison approach, we will build an approximate confidence interval for prediction of each group using the following equation: `$$\hat{\mu}_j \pm 1.96 \times \sqrt{\frac{\hat{\sigma}^2}{n_j}}$$` -- - Where the value `\(1.96\)` comes from the normal distribution and it's the value at which the normal accumulates `\(0.975\)` of its probability. `\(\hat{\mu}_j\)` is the prediction for group `\(j\)` and SSE is the sum of squared error of the model. -- - The value of `\(n_j\)` is equal to the number of observations that we use to calculate each prediction. For example, with the Effects model we used 25 observations to calculate each `\(\hat{\mu}_j\)` which means that we would use `\(n_j = 25\)` and we have 3 predictions so `\(k=3\)`. --- ## Adding error bars to model predictions - In comparison, with the new drug model we used only 2 predictions, so `\(n_{new}=25\)` for the new drug condition and `\(n_{other} = 50\)` for other condition. Therefore, we need to calculate the sample size according to the model we selected. - In order to visualize our model's predictions in R first we have to create a new data with the variables that we need. -- ```r # create a new data frame with prediction by group and error confint_effects <- drug_trial %>% group_by(condition) %>% summarise("sample_size" = n(), "prediction" = mean(blood_pressure), "standard_error" = 1.96 * sqrt(mse_effects/sample_size)) ``` --- ## Visualizing model fit: .pull-left[ ```r ggplot(data = confint_effects) + aes(x = condition) + aes(y = prediction) + aes(color = condition) + geom_pointrange( aes(ymin = prediction - standard_error, ymax = prediction + standard_error, color = condition), position = position_dodge(0.3), fatten = 2, size = 2.8) + geom_dotplot(data = drug_trial, binaxis = 'y', stackdir='center', dotsize = 0.5, method = "histodot", col = "#555555", fill = "#555555", alpha = 0.3, binwidth = 2, mapping = aes(x = condition, y = blood_pressure)) + theme_classic() + xlab("Condition") + ylab("Blood pressure") + guides(fill = guide_legend("Exam"), color = "none") + theme(axis.title.x = element_text(size = 25), axis.title.y = element_text(size = 25)) ``` ] .pull-right[ <img src="data:image/png;base64,#lec-10_files/figure-html/eff-fit-out-1.png" style="display: block; margin: auto;" /> ] --- ## Homework data link: ```r link <- "https://raw.githubusercontent.com/ManuelVU/psych-10c-data/main/homework3-p1.csv" ```